CG30403 KO RNAseq analysis¶

This is CG30403 KO RNAseq analysis, the samples are sequenced in 20231020. In this time TE analysis is included. 3 KO samples(P16,P17,P18) and 3 control samples(P19,P20,P21)

kb index¶

I tried to understand how to generate "dmel-all-chromosome-r6.42_and_drorep25.fa" file and "dmel-all-genes_and_tes.gtf" (path: /rhome/mninova/bigdata/droso_scRNAseq/index), Then I use kb-python to run index them and generate t2g.txt file.

In [ ]:
IDX=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/kb_index_test/dmel-all-chromosome-r6.42_and_drorep25.idx
T2G=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/kb_index_test/t2g.txt
FASTA=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/kb_index_test/dro_cDNA_rep.fa
GENOME=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/dmel-all-chromosome-r6.42_and_drorep25.fa
GTF=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/dmel-all-genes_and_tes.gtf

kb ref -i $IDX -g $T2G -f1 $FASTA $GENOME $GTF

Then I found this will generate a empty index file, it shows "WARNING [main] Gene FBgn0013687 has no transcripts. The entire gene will be marked as a transcript and an exon with ID FBgn0013687”. Currently I cannot solve this problem, so I just use pre-exist indexed file in /rhome/mninova/bigdata/droso_scRNAseq/index to do downstream works

Kallisto quant¶

Here is the key code:

R1 and R2 are sample reads.

To know whether those reads are stranded(--fr) or reverse stranded(--rf), i checked previous analysised STAR alignment result, checking STAR output file with suffix “ReadsPerGene.out.tab” and I am sure the reads are reverse strand.(For more information, check https://pachterlab.github.io/kallisto/manual)

-b option specifies the number of bootstrap samples. Bootstrapping is a statistical method used to estimate the variance of the estimated transcript abundances. The higher the number of bootstrap samples, the more accurate the confidence intervals will be, but this also requires more computational resources and time.

In [ ]:
module load kallisto

IN_DIR=/bigdata/ninovalab/shared/KW01-06_CG30403_KO_polyA_RNA/00_FASTQ
OUT_DIR=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/result
INDEX=/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/dm6_with_RepBase_kbref_index

for R1 in $IN_DIR/*READ1-Sequences.txt.gz; do
        BASENAME=$(basename $R1 -READ1-Sequences.txt.gz)
        OUTPUT_NAME=$(echo $BASENAME | awk -F'-' '{print $4}')
        R2=$IN_DIR/${BASENAME}-READ2-Sequences.txt.gz
        mkdir -p ${OUT_DIR}/${OUTPUT_NAME}
        kallisto quant -i $INDEX -b 30 --rf-stranded -o ${OUT_DIR}/${OUTPUT_NAME} $R1 $R2
done

Tximport¶

First is use Tximport package to summarize kallisto quant result

In [ ]:
library(tximport)
library(tidyverse)

##Here i want to get the path to access the abundance.tsv file, which generated from Kaliisio quant
base_dir <- "/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/result"
folders <- paste0("P", 16:21) # This creates a vector "P16", "P17", ..., "P21"
abundance_paths <- lapply(folders, function(folder) {
  file.path(base_dir, folder, "abundance.tsv")
})
abundance_paths <- unlist(abundance_paths)
##Rename the column name
names(abundance_paths) <- paste0("P",16:21)

#data <- read.table("/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/t2g.txt", sep = "\t",header = FALSE) #This will raise error "line 1896 did not have 8 elements, this is because there is a \t in this line"
data <- read.table("/rhome/kwu096/bigdata/20240223_CGKO_analysis/00_kallisto_analysis/t2g.txt", sep = "\t", header = FALSE, quote = "", col.names = c("FBtr", "FBgn", "GeneSymbol", "Transcript", "Chr", "Start", "End", "Strand"))

#Prepare tx data, this will be used in tximport
tx <- data[c("FBtr", "GeneSymbol")]
tx <- as_tibble(tx)

txi_gene <- tximport(abundance_paths,
                     type = "kallisto",
                     tx2gene = tx, # Make sure this is defined
                     txOut = FALSE, # If FALSE then the output will be gene name, otherwise transcript ID
                     countsFromAbundance = "lengthScaledTPM", ##Kalliso result is TPM
                     ignoreTxVersion = TRUE)
txi_gene$counts <- txi_gene$counts[rowSums(txi_gene$counts) != 0, ] # Remove no count gene

Deseq2 analysis¶

Then use DESeq2 to do differential expression analysis.

In [ ]:
metadata <- data.frame(
  id = c("control1", "control2", "control3", "CGKO1", "CGKO2", "CGKO3"),
  group = c("control", "control", "control", "CGKO", "CGKO", "CGKO")
)
metadata$group <- factor(metadata$group)
#metadata$group <- relevel(metadata$group, ref = "control")

library(DESeq2)
dds <- DESeqDataSetFromTximport(txi_gene,
                              colData = metadata,
                              design=~group)
dds <- DESeq(dds)
res <- results(dds)
res_df <- data.frame(res)

plotMA(res,ylim=c(-10,10))
upregulated_significant_genes <- res[!is.na(res$log2FoldChange) & res$log2FoldChange > 0 & !is.na(res$padj) & res$padj < 0.000000000000000000001, ]
up_gene <- rownames(upregulated_significant_genes)
gene_info <- data.frame(gene = up_gene,
                        lfc = res[up_gene, "log2FoldChange"], avg = res[up_gene, "baseMean"])
with(gene_info, text(avg, lfc, labels=gene, cex=0.5, pos=4, col="red"))

file_show.tiff

Then check whether up-regulated gene contains POGO and HetA

In [ ]:
> up_fold2_p <- res[res$log2FoldChange > 2 & !is.na(res$padj) & res$padj < 0.1, ]
> # Use cat() function to print rowname(genename) without quotes
> cat(rownames(up_fold2_p))
CG10934 CG13532 CG17104 CG17224 CG17264 CG42296 CG9525 GABA-B-R3 prom TE:DMHMR2 TE:HETA TE:POGO TE:QUASIMODO2-LTR_DM